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Abstract: Commuting operations greatly simplify consistency in distributed systems. 
This paper focuses on designing for commutativity, a topic neglected previously. We show 
that the replicas of any data type for which concurrent operations commute converges to 
a correct value, under some simple and standard assumptions. We also show that such a 
data type supports transactions with very low cost. We identify a number of approaches and 
techniques to ensure commutativity. We re-use some existing ideas (non-destructive updates 
coupled with invariant identification), but propose a much more efficient implementation. 
Furthermore, we propose a new technique, background consensus. We illustrate these ideas 
with a shared edit buffer data type. 
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Conception d'un type de donnees replique commutatif 

Resume : La commutativito clcs operations simplific grandcnicnt la coherence dans les 
systemes repartis. Ce papier aborde la conception visant la commutativite, qui est un sujet 
neglige. Nous demontrons que les replicats tout tout type de donnees, dont les operations 
concurrent commutent, convergent vers une valeur correcte, sous des hypotheses simples et 
courantes. Nous montrons aussi qu'un tel type de donnees pent executer des transactions 
a un cout tres faible. Nous identifions quelques approches et quelques techniques qui as- 
surent la commutativite. Nous rcutilisons quelques idces cxistantes (les miscs a jour non 
destructives couplees a une identification invariante) mais nous en proposons une realisation 
beaucoup plus efficace qu'auparavant. De plus nous proposons une nouvelle technique, celle 
du consensus en tache de fond. Nous illustrons ces idees sur un example de tampon d'edition 
partage. 

Mots-cles : Replication des donnees, replication optimiste, operations commutatives 
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1 Introduction 

To share information, users focated at several sites may concurrently update a common 
object, e.g., a text document. Each user operates on a separate replica (i.e., local copy) of 
the document. A well-studied example is co-operatively editing a shared text. 

As users make modifications, replicas diverge from one another. Operations initiated 
on some site propagate to other sites and are replayed there. Eventually every site executes 
every action. Even so, if sites execute them in different orders, their replicas might still not 
converge. Various solutions are available in the literature; for instance, serialising the actions 
[7] or operational transformation [22]. Such designs are usually complex and non-scalable; 
thus, despite an extensive literature, there is still no satisfactory solution to the shared text 
editing problem. 

We suggest a difiierent approach: design replicated data types such that operations 
commute with one another. Let us call such a type a commutative replicated data type or 
CRDT. CRDT replicas provably converge. Furthermore, CRDTs support transactions "for 
free." However, designing a non-trivial CRDT is difficult. 

Although the advantages of commutativity are well known, the problem of designing 
data types for commutativity has been neglected. Recently, Oster et al. proposed a repli- 
cated character buffer CRDT called WOOT [14]. WOOT operations commute, because 
updates are non-destructive, and because the the identity of a character does not change 
with concurrent edits. However, WOOT has some drawbacks: it wastes a lot of space, and 
it does not support block operations such as cut-and-paste. 

This paper presents the design of a non-trivial CRDT for concurrent editing, called 
treedoc. Since it is a CRDT, convergence is guaranteed. It supports block operations. Space 
overhead is kept to a minimum: there is no to little internal meta-data; deleted information 
can be forgotten; and identifiers are kept short. Common edit operations respond locally and 
suffer no network latency. Treedoc is fault-tolerant and supports disconnected operation. 

As in WOOT, ordinary editing operations are non-destructive and identification does 
not change with concurrent edits, but our implementation is very different from WOOT. 
The basic treedoc structure is a binary tree of atoms. The path to a node is a bitstring. For 
efficiency, structural operations switch between a flat buffer and a tree. These operations 
are potentially non-commutative; to avoid this problem, structure changes rely cither on 
common knowledge or on consensus. To avoid the latency associated with consensus, it 
occurs in the background (not in the critical path of editing operations) and aborts if it 
conflicts with an edit. 

In summary, the contributions of this paper are the following: 

• A design principle: concurrent operations should commute. We prove that any Com- 
mutative Replicated Data Type (CRDT) converge, under some simple and standard 
assumptions. 
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• We identify two alternative approaches to commutativity: operation coalescing vs. prece- 
dence. Coalescing is better, hut is not always possible. 

• The design of treedoc, a non-trivial, space-efficient, responsive, coalescing CRDT for 
distributed editing. 

• We identify some previously-published techniques for coalescing, such as non-destructive 
update and invariant identity. We propose a novel implementation of these techniques. 

• We propose a novel technique for coalescing: where a consensus is necessary, it is re- 
stricted to non-essential operations, occurs in the background, and aborts if it conflicts 
with an essential operation. 

• We show that a CRDT readily supports transactions at a very low implementation 
cost. 

The paper proceeds as follows. Section 1 is this introduction. We describe our system 
model in Section 2. Section 3 describes the shared buffer abstract data type. We suggest 
a simple implementation of this data type in Section 4. In Section 5, we examine how 
to convert to a more efficient representation and back. Section 6 explains the full treedoc 
implementation, combining the advantages of the two preceding sections. We build trans- 
actions on top of a CRDT in Section 7. Section 8 compares with previous work. Section 9 
concludes. We provide a proof of convergence in Appendix A. 

2 System model 

2.1 Replicated execution and eventual consistency 

We consider an asynchronous distributed system, consisting of N sites (computers) con- 
nected by a network. Communication between connected sites is reliable. A site may 
disconnect but eventually reconnects. We assume an epidemic style of communication, i.e., 
a site connects at arbitrary intervals with arbitrary other sites, sending both local updates 
and those previously received from other sites. Eventually, every update reaches every site, 
either directly or indirectly. 

With no loss of generality, wc consider a single object replicated at any number of sites. 
A user accesses the object through his local replica, initiating operations at the current 
site.^ The operations execute locally and are logged. Eventually the log is transmitted and 
the operations it contains are replayed at other sites. Eventually all sites execute the same 
operations (either by local submission or by remote replay), in some sequential order, but 
not n(X-(>ssaril>- in lliv s;im(> ordc^r. 

^We assume that a given operation is initiated at a unique site. 
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We say operation o happens before o' (noted o ^ o') if some site initiates o' after the 
same site has executed o.^ We require that if o ^ o', then all sites execute o before o' (not 
necessarily immediately before). Common epidemic protocols, such as Bayou's anti-entropy 
[16], ensure this so-called "causal ordering" property. To implement this property, it suffices 
to delay the execution of some operation o, until all operations that happen before o have 
been executed. Well-known techniques such as vector clocks or version vectors [10] can be 
used to track happens-before dependencies. 

Operations are concurrent if neither happens before the other: o \\ o' -i(o —>■ o') A 
-(o'-o). 

Two operations o and o' commute iff, whatever the current state of the object, such that 
execution of either o or o' succeeds, executing o immediately followed by a' also succeeds, 
and leads to the same state as executing o' immediately followed by o. 

A Commutative Replicated Data Type (CRDT) is a data type where all concurrent 
operations commute with one another. We prove (in Appendix A) that CRDTs guarantee 
eventual consistency: provided that every site executes every operation in an order consistent 
with happens-before, the final state of replicas is identical at all sites. 

Furthermore, CRDTs support serialisablc transactions with virtually no overhead. If all 
operations commute, so do arbitrary sets of operations. If every site executes transactions 
sequentially, in an order consistent with happens-before, the local orders arc all equiva- 
lent. This ensures serialisability. Furthermore, transactions never abort. Hence, very little 
mechanism is needed. We return to transactions in Section 7. 

2.2 Ensuring that operations commute 

Two operations a and /? commute if, for any state T, execution sequences {T ■ a • (3) and 
{T ■ P ■ a) are both correct states and are equivalent. There are two basic approaches for 
ensuring commutativity, which we call coalescing and precedence. 

Intuitively, coalescing means that a preserves the effect of (3 and vice- versa; i.e., the post- 
condition of both operations is satisfied, whatever their relative execution order. ^ This is the 
standard meaning in mathematics, for instance when we say that addition and subtraction 
of integers commute. 

The alternative is to define an order of precedence, say /3 takes precedence over a: when 
both operations execute in either order, (3 takes effect, but not necesariy a. A typical 
implementation is that in the order {a ■ (3), the latter overwrites the results of the former, 
and in the order {P ■ a), the latter is replaced by a no-op. For instance, most replicated file 
systems follow the "Last Writer Wins" rule [19]: when two users write to the same file, the 

^The site executes o either because it was initiated locally, or because it was initiated at another site and 
delivered here in a message. Thus the — » relation is identical to Lamport's happens-before [7]. 
^This is sometimes called "intention preservation." 
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write with the highest timestamp takes precedence. The write with the lowest timestamp 
may be lost. (In contrast, writes to different files coalesce.) 

Clearly, the coalescing approach is preferable to precedence; but precedence is much 
easier to achieve. 

3 Shared buffer replicated data type 

We consider a shared, replicated document, consisting of a linear sequence of atoms. An 
atom may be a character or some other immutable payload, e.g., a graphical illustration 
inserted inside the document. We designed treedoc it to be as unrestricted and flexible as 
possible, to enable a variety of applications to use it. 

Each user has a copy of the document. Each user can modify his replica independently 
by executing two types of edit operations: 

• insert{insertpos , newatom, S) visibly inserts atom newatom in the document. (In the 
underlying data structure, there may be other data before or after insertpos, but it is 
not visible to the user.) All atoms at positions strictly less than insertpos lie to the left 
of newatom; all those strictly greater than insertpos lie to its right. The S argument 
is the initiating site, as justified later. 

• delete{delpos , S) visibly removes the atom existing at position delpos. (The atom may 
still be in the data structure but is not visible to the user any more.) The S argument 
is the initiating site. 

We defer to Section 7 the description of a transaction construct, enabling atomic bulk 
operations such as cutting and pasting a block of text, or searching and replacing all instances 
of a pattern. 

Our treedoc design ensures commutativity by coalescence, i.e., the effect of insert and 
delete is the same at all sites. 

At a higher level, the application might have stronger requirements. For instance, it 
might disallow inserting characters inside deleted text; or it might ensure a proper hierarchy 
of chapters, sections and paragraphs; etc. Enforcing such semantic conflicts requires higher- 
level conflict detection and resolution mechanisms, which are non-commutative, but are out 
of our focus. 

Similarly, we emphasise that the stringtree structure is completely decoupled from any 
higher- level document hierarchy (e.g., XML tree structure). 
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3.1 Unique identifiers for positions 

Let us assume the existence of unique position identifiers, with the following properties: 

• Each postion in the atom buffer has an identifier that is unique with respect to all 
other positions, and that remains constant, for the whole lifetime of the document.^ 

• There is a total order of position identifiers, noted <. 

• Given any two identifiers L and R such that L < R, it is possible to generate a fresh 
unique identifier N such that L < N < R. 

We will call these identifiers UIDs (unique identifiers). Real numbers have the properties 

required for UIDs, but the third property requires infinite precision, which is not realistic. 
In Section 4 we will present a practical alternative, based on trees. 



3.2 Abstract atom buffer CRDT 



Consider an abstract data type whose state T is a set of (uid, atom,) couples, where uids 
arc unique. The content of state T is the sequence of all atoms in T ordered by their 
uid. Operation insert{u,a, S) adds the pair {u,a) to the set. If a pair (w,a) exists in the 
set, operation delete(u,S) removes the pair, whatever a. We now prove that concurrent 
operations of this data type commute. 

Lemma 1. Insert operations commute. For any data state T, any fresh unique identifiers 
ui and U2, any atoms ai and 02, and any originating sites Si and 82'- {T ■ insert{ui, ai, ^i) • 
insert{u2,a2, S2)) = {T ■ insert{u2,a2, S2) ■ insert{ui,ai,Si)). 

Proof. After executing the two insert operations, the resulting state includes the two new 
atoms. Furthermore, atoms are ordered by unique identifiers. Therefore, the final state is 
the same. □ 

Lemma 2. An insert operaMon commutes with a delete operation when they refer to different 
unique identifiers. For any state T , any fresh unique identifier ui, any unique identifier U2 
ui. any atom, ai, and any originating sites Si and S2: (T ■insert{ui,ai, Si)- delete{u2, S2)) = 
{T ■ delete{u2, S2) ■ insert{ui,ai, Si)). 

Proof. Two cases must be considered. First, when T includes the atom with identifier U2. 
By executing both operations in any order, the final state of T will include an additional 
atom identified by ui and it will not include the atom identified by U2- As atoms are ordered 
by their unique identifier, the final state is the same. Second, when T does not include the 
atom with identifier U2- By executing both operations in any order, the final state of T 
will include an additional atom identified by ui (the atom identified by U2 was not present 
in the original state. As atoms are ordered by their unique identifier, the final state is the 
same. □ 

^However, an unused identifier can be garbage-collected and re-used. We do not attempt to formalise 
this property. 
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Lemma 3. If an insert operation and a delete operation refer to the same unique identifier, 
then the insert happens-before the delete. 

Proof. According to the specification of Section 3, a user may initiate operation delete{u, S) 
at site S only if a pair {u, a) exists (for some o) in the current state at site S. This pair 
must have been inserted by an insert operation executed previously at site S. □ 

Lemma 4. Delete operations commute. For any state T, any unique identifiers ui and U2 
and any originating sites Si and S2: {T ■ delete{ui, Si) ■ delete{u2, S2)) = {T- delete{u2, S2) ■ 
delete{ui, Si)). 

Proof. For any original state T, the final state will not include the atoms identified by ui 
and U2, but it will include all other atoms, as no other atom will ever has the same unique 
identifier. Thus, the final state will include the same set of atoms and, as atoms are ordered 
by their unique identifier, the final state is exactly the same. □ 

Theorem 1. The data type described in this section is a CRDT. 

Proof. By the above lemmas, all concurrent operation pairs (insert-insert, delete-delete, 
insert-delete) commute. □ 

4 Treedoc abstract data type 

We start with a very simple design, which satisfies the coalescence requirement, but has 
some limitations. In later sections, we will improve the design. 

4.1 Paths 

Wc manage the document as a binary tree. A tree node contains either a single atom, or nil. 
The identifier of an atom is its path in the tree. The path to the root is the empty bitstring 
e; the path concatenation operator is noted 0. The left child of a node is 0, its right child 
is 1. We walk the tree in infix order, skipping nil nodes (but not their descendants). 

For example, Figure 1 represents the document state "abcdef ", with the following iden- 
tifiers: id{a) = [00]; id(h) = [0]; id{c) = []; id{d) = [10]; id{e) ^ [1]; id{f) = [11]. 

We define the following partial order over identifiers. Node idi is to the left of id2 (or, 
equivalently, id2 is to the right of idi), noted idi < id2, iff: 

• idi = [ci...Cn] is a prefix of id2 = [ci...Cnji...jm.] and ji = 1, or 

• id2 = [ci...Cn\ is a prefix of idi = [ci...Cnii...im] and ii = 0, or 

• idi = [ci...Cnii...in] has a common prefix with id2 = [ci...Cnji...jm\ and ii = 0. The 

prefix may be empty. 
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Figure 1: Identifiers in a shared text buffer 

We also define tlie ancestry of a node. Node u is the (direct) parent of node v, noted 

u/v, iff id{v) = id{u) V id{v) = id{u) © 1; cquivalcntly, v is a (direct) child of u. Node 
u is an ancestor of v (or, equivalently, w is a descendant of u), noted u/~^v, if u is a parent, 
or grand-parent, or great-grand-parent, etc., of v. 

4.2 Deleting 

We start with the simplest procedure, deleting an atom: simply replace the content of the 
node with nil. Since the identification of the deleted node is unique, it is clear that the 
initiator and replay executions will all delete the same node. Sometimes, during replay, 
the node to be deleted may not exist, but this can only be because it was already deleted 
previously. 

We will say that the delete is stable once it has been executed at all nodes. No operation 
that happens-after the delete is stable will ever refer to the node identifier; therefore, if the 
node is a leaf, it can be completely forgotten (and so on recursively). Thus a subtree that 
contains only stably deleted nodes can be completely removed and forgotten. 

To this effect we introduce a gc{N) procedure that removes leaf N if it is stably deleted. 
A node may call gc{N) at any time after N is deleted. Operation gc is local only, it does 
not have a replay version. 

Procedure stabledel{N) tests for stability. Conceptually, stabledel{N) waits for acknowl- 
edgments from all sites that have executed deleted. We refer to Golding [4] for an efficient 
implementation of stability that compacts acknowledgments for all past operations into a 
single vector clock or matrix clock. 

4.3 Implementing inserts 

To insert newatom between atonip and atom f , we must grow the tree in a way that satisfies 
the relation id{atomp) < id{newatom) < id{atomf). In this section, we present a very 
simple algorithm that does not attempt to balance the tree; later, we will resolve this issue. 
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Algorithm 1 New unique identifier for insert 
1: function newUID {{atomp,uidp),{atomf,uidf)) // (atonipjUidp): previous atom; 

{atom f , uid f) : following atom; 
2: Require: uidp < uidf 

3: if 3{atom„i,uidm) ■ uidp < uidm < uidf then return 

newUID{{atomp, uidp), {atomm, uidm)) 
4: else if {atomp, uidp) {atom f, uidf) then return uidf 
5: else if {atom,f, uid f) {atomp, uidp) then return uidp 1 
6: else return uidp 1 



Algorithm 1 starts by checking whether there is a node between atomp and atomf. If 
so, it recursively looks for the leftmost predecessor of uidf that remains to the right of uidp. 
When there is no node between atomp and atom j, three cases may occur: 

• Node uidp is an ancestor of uidf, i.e., uidf is a right descendant of uidp. In this case, 
uidf has no left child, so we create a new left child of node uidf. The new identifier 
is uidf 0. 

• Or, symmetrically, node uidf is an ancestor of uidp. The new identifier us uidp 1. 

• Or, neither is an ancestor of the other. In this case, uidp has no right child, so we 

create a new right child of node identified uidp 1 . 

In the example of Figure 1, for inserting atom X between c and d, a left child is created 
under d with identifier [100], as shown in Figure 2. 




Figure 2: Identifiers after inserting a new character atom 



4.4 Concurrent inserts 

In case of concurrent updates, a binary tree becomes insufficient, because two users can con- 
currently insert an atom at the same position. We maintain the basic binary tree structure, 
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but we extend a node to contain any number of side nodes (and their descendence), disam- 
biguated by a {sitelD, counter) pair, where sitelD identifies the initiator site. We assume a 
total order of site identifiers, hence of disambiguators, hence of side nodes: (si, ci) < (s2, C2), 
iff ci < C2 or ci = C2 and si < S2- 

Algorithm 1 generates new unique identifiers for insertion. Figure 3 shows an ex- 
ample of the situation. Assuming that characters X and Y were inserted with the as- 
sociated disambiguator idX and idY respectively, we have id{lL) = [(1)(0)(0, idX)] and 



Figure 3: Side nodes and their identifiers, after concurrent inserts at position [100] 

Since a concurrent update can occur at every level, conceptually, every node may 
include a disambiguator. Thus, in the example, we could have, for example, id{d) = 
[( — , idC){l, idE){0, idD)]. The full state is as in Figure 4. 



Figure 4: Fully expanded identifiers, after concurrent inserts at position [100] 

When disambiguators are used, the total order among identifiers is defined as follows: 
Node idi is left of node id2 (or, equivalently, id2 right of idi), noted idi < id2, iff: 

• idi = [ci—Cn] is a prefix of id2 = [ci-Cnji—jm] and ji = (1, *), or 

• id2 = [ci...Cn\ is a prefix of idi = [ci...Cnii---im\ and ii = (0, *), or 
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• idi — [ci...c„ii...z„] has a common prefix with id2 = [ci...Cnji---jm] and ii = (0,*) A 

ji = (1, *) or il = (fc, di) A jl = {k, ^2) A di < d2. The prefix may be empty. 

To generate a new unique identifier, the algorithm 1 is used, with the returned identifier 
extended with a freshly generated disambiguator. 

Once all concurrent inserts at the same location have executed at some site, redundant 

disambiguators can be removed. We will say that an insert is stable at some site, once that 
site has received from all other sites some operation that happens-after the insert. At that 
point, it is guaranteed not to receive another concurrent insert at the same node. (To ensure 
this happens quickly, sites that arc not actively editing should send out occasional no-ops.) 
Procedure cleanside removes redundant disambiguators; it is a local procedure (it has no 
replay version). 

4.5 Treedoc abstract data type 

Algorithm 2 contains detailed specification of the simple treedoc data type. In addition to 
the operations given at the beginning of Section 3, we specify gc and cleanside as explained 
above. 

The initiator versions of insert and delete have pre-conditions, to make sure that the 
user only addresses valid nodes, and to avoid wastage of space. However, there may be no 
restrictions on the replay version. Therefore, the replay version has no precondition, and 
simply re-creates any nodes that it may be missing. 

This data type satisfies the coalescence requirement. Since every atom has an identifier 
that docs not change with other operations, replaying a delete removes the intended atom. 
Inserting an operation with a path positions it with respect to its left and right neighbours, 
replaying an insert preserves the intended location. Since this data type is a CRDT, replicas 
are guaranteed to converge to the same (correct) value. 

In Algorithm 2, the notation N[siteID] stands for the side node of N identified by sitelD. 
The notation N[[siteID]] stands for N[siteID], if it exists, and N otherwise. lamlnitiator is 
true on the initiator site, and false on all replay sites. 

5 Identifying a sequential buffer to a binary tree 

The approach so far has a number of limitations. Paths are variable length and can become 
very inefficient if the tree is unbalanced (e.g., if users always append to the end of the buffer). 
Concurrent inserts complicate the structure and the paths. The tree metadata consumes 
memory; for instance, if atoms are one-byte characters the overhead can be several times 
the payload. Finally, deleted atoms that cannot be garbage-collected waste space. 

Rather than attempt to fix each of these issues individually, we propose a more radical 
solution. In this section, we discuss structural operations that switch between the efficient 
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Algorithm 2 Simple treedoc 
1: procedure contents (N) // N: a treedoc node 
2: Walk subtree rooted at N in infix order 
3: Return the atoms of the non-empty nodes 

4: procedure insert {A, N, S) 

5: //A: atom to insert 

6: // N: insertion position, chosen by newUID 

7: Require: S = the initiator site 

8: Require: A ^ nil 

9: Require: lamlnitiator all ancestors of node A'' exist 

10; Require: lamlnitiator =^ -^[['5]] does not exist 

11: Create any missing ancestors of N 

12: If necessary, create node TV 

13: Create side node n = N[siteID] with n. contains = A 

14: procedure delete {N,S) // N : node to be deleted 

15: Require: lamlnitiator -/V[[<S']] exists and N[[S]]. contains ^ nil 

16: Require: S = the initiator site 

17: Create any missing ancestors of N 

18: If necessary, create node IS! 

19: N\^S'\\.contains := nil 

20: Send acknowledgment of delete{N, S) to all sites 

21: procedure stabledel {N) // Await stable delete of node N 

22: if acknowledgment for delete{N, *) received from all sites then return true 

23: else return false 

24: procedure gc (N) // N: a treedoc leaf 
25: Require: N. contents {=)ni\ 
26: Require: stabledel (N) 
27: Require: lamlnitiator 
28: Remove A'^ 

29: procedure cleanside (N) // N: a treedoc node 
30: Require: stableinsert (N) 
31: Require: lamlnitiator 

32: if |{A^[S']|VS'}| = 1 then // There is a single side node 
33: A'^ := A'^[<S'] // Remove redundant disambiguator 

34: procedure stableinsert (N) 

35: if current site has received some operation that happens-after insert{*, N, *) from 

every site then 
36: return true 

37: else return false 
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Algorithm 3 explode and flatten 
1: procedure explode (atomstring) 
2: depth = \log2{length{atomstring) + 1)] 
3: T = Allocate a complete binary tree of depth depth 
4: Populate T in infix order with the atoms of atomstring 
5: Remove any remaining nodes 
6: Return T 

7: procedure flatten (N) // N : root of a subtree to he flattened 
8: Walk subtree in infix order 

9: Return a linear buffer containing the atoms of the non-empty nodes 



flat buffer representation, and the edit-oriented tree representation. The speciflcation of 
these operations is as follows. 

• explode {atomstring) . Returns a treedoc whose contents is identical to atomstring. 

• flatten{path) Returns an atom string whose contents is identical to the sub- treedoc 
rooted at path. 

The initiator and replay versions of these operations must have identical effect. In 
particular, explode must return exactly the same structure at all sites. 

Different implementations of explode are possible, as long as it has the same effect 
at every site. Observing that the capacity of a complete binary tree with depth levels is 

2depth _ suggest the simple implementation in Algorithm 3. 

With these two operations, we can choose the string representation or the treedoc rep- 
resentation. The former is compact and efficient, but it does not readily support concurrent 
edits. When a treedoc becomes unbalanced or contains many nil nodes, it suffices to flatten 
then explode it to fix the problem. 

However, these structural operations do not commute with edit operations. We study 
the solution of this problem next. 

6 Mixed tree 

In this section, we study how to combine edit operations and structural operations, while 
still retaining the advantages of a CRDT. 

A first observation is that the explode operation is not really necessary. Algorithm 3 can 
be interpreted as a mapping from a string to a canonical treedoc representation. Applying 
a path to a string implicitly converts the string to the canonical treedoc. Eliminating the 
explicit explode operation removes the need to make it commute with edits. 
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A second observation is that flatten is not an essential operation. Aborting a flatten 
(leaving no side-effects) causes no harm. Therefore, if flatten is concurrent with an edit 
operation in the same subtree, we abort it. More precisely, for flatten to take effect, it 
executes a distributed commitment procedure. When executing flatten at some site, if 
this site observes the execution of a concurrent insert or delete, that site votes "No" to 
commitment, otherwise it votes "Yes." The operation succeeds only if all sites vote "Yes," 
otherwise it has no effect. 

Any distributed commitment protocol from the literature will do, for instance two-phase 
commit or Gray and Lamport's fault-tolerant protocol [5]. 

We may now envisage a mixed tree, where parts that are currently being edited are in 
treedoc representation, and parts that are currently quiescent are represented as strings. 

6.1 Fault tolerance and disconnected operation 

Ensuring fault tolerance and disconnected operation for disconnected edits is straightfor- 
ward. Every site logs all its operations (whether locally initiated or remote) on persistent 
storage. When a site that was disconnected for some time reconnects with the rest of the 
system, it simply exchanges with other sites the missing information. If a site fails and 
recovers the situation is the same. If a site crashes, losing its memory, then when it restarts 
it behaves like a new site, and copies over the state of some other site; operations that it 
initiated before the crash and never sent to another site are lost. 

The situation is more complex for flattens, since they require a consensus. To ensure that 
consensus is solvable in the presence of crashes, we assume the existence of fault detectors 
[2]. 

To allow disconnected operation, fault detectors must be capable of distinguishing dis- 
connection from a crash. During the commit phase of flatten, a disconnected site is assumed 
to be voting No, and flatten aborts. This is distinct from a crashed site, which does not 
participate in the commitment. 

Similarly, stabledel should be modified to return true if all non-crashed sites have ac- 
knowledged, and stableinsert should return true if all non-crashed sites have sent an opera- 
tion that happens- after the insert. 

Note that if a disconnected site is falsely diagnosed as crashed, any operations that it 
initiated within a sub-tree that was flattened cannot be replayed, because they use now- 
forgotten node identities. Such operations are lost. Similarly, if this site initiated operations 
that depend on a node that was deleted and garbage-collected, then these operations are 
lost. 
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7 Block edits and transactions 

In practice, single-character edit operations are insufficient for concurrent editing. Users 
working on the same portion of text may find it unpleasant to see their edits mixed together. 
Furthermore, common operations such as as cutting and pasting, or global replacement, are 
block operations. We need transactions, to allow a user to insert, delete, replace or move 
a block of text, without concurrent operations destroying the integrity or location of the 
block. 

Fortunately, a CRDT is ideal for building complex transactions out of simple operations. 
Since individual operations commute when concurrent, concurrent groups of operations com- 
mute as well. To ensure serialisability, it is sufficient to ensure that transactions are executed 
sequentially (in any order compatible with happens-before) , whether at the initiator site or 
during replay.^ 

A transaction executes atomically (all-or-nothing), indivisibly (its intermediate results 
cannot be observed) and durably (its results are observable by all later operations). We do 
not see any need for nested transactions. 

Since individual operations commute, a transaction never aborts, therefore transaction 
support can be very cheap. All that is needed is some book-keeping of the beginning and 
end of transactions, and buffering received operations to ensure sequential execution. While 
a site is executing a locally-initiated transaction, it buffers remote operations, delaying their 
replay until the end of the transaction. When a site receives a remote transaction, it buffers 
the operations it contains until the end of the transaction is received, and replays them 
all at once. Thus, begin-transaction and end -transaction are basically no-ops used only as 
place-holders in the log. 

• begin -transaction opens a transaction. At the initiator site, the transaction will in- 
clude all operations initiated at the same site, until the next end -transaction. It is 
illegal to initiate two successive begin -trans action operations without an intervening 
end -transaction. 

• end -transaction closes the current transaction by the same initiator. It is illegal to 

initiate a end -trans action unless a transaction is open. 

Considering any two transactions (or isolated operations), either one happens-before 
the other, or they are concurrent. In particular, if a transaction contains an operation that 

edits node N, and any operation of the same transaction is concurrent with flatten{N'), and 
N'/+N V TV = iV', then the flatten operation aborts. 

Note that we could now define block operations such as a block move or a global search- 
and-replace. From the commutativity perspective, such new operation types are considered 
equivalent to a transaction of insert and delete operations, but they can be implemented 
much more efficiently. 

^For the purpose of sequential execution, an operation that is not part of any transaction is considered 
as a separate transaction of itself. 
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8 Related work 

A comparison of several approaches to the problem of collaboratively editing a shared text 
was written by Ignat et al. [6] . 

Operational transformation (OT) [22] considers collaborative editing based on non- 
commutative single-character operations. To this end, OT transforms the arguments of 
remote operations to take into account the effects of concurrent executions. OT requires 
two correctness conditions [22]: the transformation should enable concurrent operations to 
execute in either order, and furthermore, transformation functions themselves must com- 
mute. The former is relatively easy. The latter is more complex, and Oster et al. [13] prove 
that all existing transformations violate it. 

OT attempts to make non-commuting operations commute after the fact. We believe 
that a better approach is to design operations to commute in the first place. This is more 
elegant, and avoids the complexities of OT. 

A number of papers study the advantages of commutativity for concurrency and con- 
sistency control [1, 23, for instance]. Systems such as Psync [11], Generalized Paxos [9], 
Generic Broadcast [15] and IceCube [17] make use of commutativity information to relax 
consistency or scheduling requirements. However, these works do not address the issue of 
achieving commutativity. 

Weihl [23] distinguishes between forward and backward commutativity. They differ 
only when operations fail their pre-condition. In this work, we consider only operations that 
succeed at the submission site, and ensure by design that they won't fail at replay sites. 

Roh et al. [18] were the first to suggest the CRDT approach. They give the example of 
an array with a slot assignment operation. To make concurrent assignments commute, they 
propose a deterministic procedure (based on vector clocks) whereby one takes precedence 
over the other. 

This is similar to the well-known Last- Writer Wins algorithm, used in shared file systems. 
Each file replica is timestamped with the time it was last written. Tiniestamps arc consistent 
with happens-before [7]. When comparing two versions of the file, the one with the highest 
timestamp takes precedence. This is correct with respect to successive writes related by 
happens-before, and constitutes a simple precedence rule for concurrent writes. 

In the precedence design of Roh et al., concurrent writes to the same location are lost. 
This is inherent to the destructive assignment operation that they consider. In ours, con- 
current inserts are always coalesced, which is important in order to support co-operative 
work. 

In Lamport's replicated state machine approach [7], every replica executes the same 
operations in the same order. This total order is computed cither by a consensus algorithm 
such as Paxos [8] or, equivalently, by using an atomic broadcast mechanism [3]. Such 
algorithms can tolerate faults. However they are complex and scale poorly; consensus occurs 
within the critical execution path, adding latency to every operation. 
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The precedence approach can be viewed as a poor-man's total order. It does not require 

an online consensus algorithm, but it loses work. 

In the treedoc design, common edit operations execute optimistically, with no latency; it 
uses consensus in the background only. Previously, Golding relied on background consensus 

for garbage collection [4] . We are not aware of previous instances of background consensus for 
structural operations, nor of aborting consensus when it conflicts with essential operations. 

9 Conclusion 

It was known previously that commutativity simplifies consistency maintenance, but the 

issue of designing systems for commutativity was neglected. This paper suggested a new 
paradigm for replication: the Commutative Replicated Data Type or CRDT, designed such 
that concurrent operations commute. We prove that, under some simple and standard execu- 
tion conditions, replicas of any CRDT eventually converge. This makes the implementation 
of replicated systems much simpler than before. Furthermore, CRDTs support transactions 
at very low cost. 

However, designing a CRDT with the desirable property that no work is lost (coales- 
cence) is not easy. We give a coalescing CRDT solution to the problem of a shared edit buffer, 
by implementing some known techniques in a novel way (using paths in a binary tree as 
invariant identifiers) and by some new techniques (abortable consensus in the background). 
This is possible only because updates are not destructive. 

Our techniques are not limited to this particular problem, and are generaliseable non- 
destructive updates in other data structures, such as directories. 

We purposely designed treedoc to support arbitrary mixtures of edit operations. We 

separate out the issue of semantic constraints and conflict detection, which we study else- 
where [12, 17, 20, 21]. We interpret conflicts as cases of irreducible non-commutativity. Any 
real system must support a mix of data types, some coalescing, some using precedence, and 
some not commutative. 

Our next step in this research will be to enable peer-to-peer co-operative editing at a 
large scale, by implementing treedoc within an existing text editor or wiki system. This will 
enable a deeper investigation of pragmatic issues and performance studies. 

A Proof of eventual consistency 

We prove the following property. Assuming: 

• That every operation, initiated at any site, eventually executes at all sites, 

• That if o — > o', then o executes before o' at every site. 
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• That all concurrent operations commute, 
then the state of the object eventually converges at all sites. 
Our proof is by recurrence over all legal execution schedules. 

A.l Notation 

A multilog M = {V, H) is a directed graph, consisting of a set of operations V = {a, /3, . . . } 
connected by the happcns-before relations H = {(x, y) xV : x —t y}. 

A schedule T = (a ■ (3 ■ . . .) of a multilog is a sequential enumeration of operations 
in some order consistent with happens-before. Formally, T = {L,<t) € sched{{V,H)) <^ 
Vx,y G L, {x,y G V) A {x <t y=>x^y)A{x^y=>x <t y)- The sequence operator is 
noted •. 

A schedule T = {L, <t) & sched{{V, H j) is said complete with respect to its multilog, 
iff it contains all operations, i.e., iff L = y. 

Operation v extends schedule T G sched{M) if (T • v) is in sched{M). 

A state quasi-state is a schedule T G sched{M) whose initial clement is distinguished op- 
eration INIT (denoting the common initial state). We assume that we can further distinguish 
(in some application-specific manner) between illegal and legal quasi-states. Any extension 
of an illegal quasi-state is itself illegal. A legal quasi-state will be called a state henceforth: 
T G state{M). We assume the existence of an equivalence relation between states =. Like 
legality, equivalence is application dependent. 

A. 2 Recurrence proof 

We require that all concurrent operations commute, i.e., given any state T and concurrent 

operations x and y that extend T, the sequences {T ■ x ■ y) and {T ■ y ■ x) are states and are 
equivalent. Formally: VT e state{{V, H)), \/x,y eV : x \\ yA{T-x) G state{{V, H))A{T-y) G 
state{(V, H)), {T-y-x) e state{{V, H)) A {T ■ x ■ y) = {T ■ y ■ x) 

Given a set of natural numbers N = {1, 2, . . . , n — 1}, we note p some permutation of 

N, with elements p{l). p{2). .... p{n — 1). 

Theorem 2 (All complete states of M are equivalent). Let M — (V, H) be a multilog 
of size \V\ = n. Let T = (init ■ q:„-i) be a complete state of M, i.e., T G 

state{M) A{wyt ^ ai, . . . , Ofn-i} = V . Let p be som,e arbitrary permMtation o/{l, 2, . . . , n—1}, 
and let Tp denote the sequence (iNiT-ap(i) •. . . ap(„_i)). Then, ifTp is a state, it is equivalent 
to T: Tp G state{M) ^T = Tp. 
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The proof is by recurrence. The theorem is obviously true for n = 1 and n = 2. Assume it 
is true for abitrary n; we shall prove that it remains true for n+1. Let T = (iNlT-ai-. . . -an-i) 
be a complete state of M = {V, H) where n= \V\. 

Consider M' = {V, H') such that M C M' A = n + 1. F and V differ by a single 
element, (3. With no loss of generality, wc assume that /3 does not happen before any element 

of V, i.e., ^x€V, {13, x) e H'.^ Note T' = {T ■ (3). 

If T' ^ state{M') the theorem is trivially true, because T' is not a state. Therefore, 
assume T' e state{M'). It follows that T' is a complete state of M'. 

For some permutation p, let T'p = (INIT • Q:p(i) • . . . • ap(„_i) • (3). Consider now the set 
of sequences derived from by permuting the position of (3: 



If the sequence on any line is not a state, then none of the following lines is a state 
either; for these, the theorem is trivially true. 

The first line is precisely T^. If G state{M'), then, by assumption, T^f^^ = {Tp ■ (3) = 
(T-P) = T'. 

Now examine the second line. Either ap(„_i) || (3, and they commute, and therefore it 
is a state equivalent to the first line, and hence to T'; or ap(„_i) — > (3, and the second line 
is not a state. Similarly for all the following lines: each sequence is either not a state, or is 

a state equivalent to T' . 

Thus we have proven the recurrence clause for all permutations of {1, 2, . . . , n} that are 
in the same order as p. Furthermore, since the recurrence clause is true for all permutations 
p of {1, 2, . . . , n — 1}, it is true for all permuations of {1,2,..., n}. QED. 

A. 3 Eventual consistency 

The above proves that, if different sites execute schedules consisting of the same set of 
operations, the order of every schedule is consistent with happens- before, and concurrent 
operations commute, then their final states are equivalent. 

If all clients stop initiating operations, and assuming that the system transmits and 
executes every operation at all sites, then all replicas converge to the same state. This is 

^In other words, the operations are sorted in — ► order before constructing the successive multilogs. 



INIT ; ap(i) ; ap(2) ; 
INIT ; ap(i) ; Q;p(2) ; 
INIT ; Q!p(i) ; ap(2) ; 



; "p(n-2) ; ap(„_i) ; (3 

;ap(n-2); /3 ;Q:p(n-i) 
; ;ap(n-2) ;Q!p(„_i) 



INIT ; ap(i) ; (3 ; 
INIT ; 13 ; Q!p(i) ; 



; Clp(n-3) ; Oip{n-2) 'i Olp(n-l) 

; Q!p(„_3) ; Q!p(„-2) ; Q!p(n-i) 
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the traditional definition of Eventual Consistency. Our result is actually stronger, since the 
final state is a correct state, and it includes all the submitted operations. 

However, in a practical system, clients don't stop initiating operations. We can prove 
nonetheless that, for any time t, the state at every site eventually includes a an equivalent 
prefix containing all operations up to t. Assume that initiating an operation is atomic. 
Consider the set O of operations initiated at all sites up to time t, and the set O' of 
operations initiated after t. (As sites do not have access to a common clock, these sets 
cannot be computed, but they exist nonetheless.) 

Any operation o € O is either concurrent or happens-before any operation o' E O' . If 
o II o', then (o • . . . • o') = (o' • . . . • o); we need consider only the former. If o — > o', then 
the only legal order is {o ■ . . . ■ o'). The operations in O are eventually executed at all sites, 
and the operations in O' execute after. Thus the state at all sites has as common prefix the 
operations in O. 
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